The objective of this model is to flag patients who will develop sepsis within 48 hours. The training data for the model was developed from the MUSC 'Hourly Labs and Vitals' pipeline. Details are in the repository readme. A combination of categorical,semi-structured, and numeric data were used to build the model. All labs, vitals, and other features used in the generation of the model are mentioned below in the 'Variables Considered' section.
SIRS Criteria is the clinical way that nurses and physicians classify adult inpatients as having sepsis. Using SIRS criteria, a person is considered to have sepsis when they have all of the following:
(1) Temperature > 100.9 or < 96.8 (and)
(2) HR > 90 (and)
(3) Respiration rate > 20 (and)
(4) White Blood Cell Count > 12 or < 4
The flag for the logistic regression performed was generated by first identifying which patients satisfied SIRS criteria, when the SIRS criteria was met, and flagging the 48 hours of data prior to the reading.
The text data was run through a tokenizer to build a bag of words representation, with stop words removed. The modeling technique used was gradient boosted tree via the xgboost package. The data by using a cut off date, all observations after cut off date were in the test set. The models were build using two optimizers, to maximize AUC and minimize miss classification log loss. L1 and L2 regularization grid search was used to optimize the model
The following variables were used to build the model. The text columns are semi-structured and put into a bag of words model. Catagorical columns were one hot encoded. All numeric variables were zero imputed when missing.
{'cat_cols': ['MUSC R AS IV DEVICE WDL', 'PAT_SEX', 'FINANCIAL_CLASS_NAME', 'AGE_GROUP'], 'cat_from_text_cols': ['CPM S16 R INV O2 DEVICE', 'CPM S16 R INV ISOLATION PRECAUTIONS'], 'zero_imputer_cols': ['AGE', 'AVAILABLE_TIME_HOUR', 'AVAILABLE_TIME_MONTH', 'AVAILABLE_TIME_DAYOFWEEK', 'AVAILABLE_TIME_DAY', 'ICUSixMonths', 'PCPSixMonths', 'SurgerySixMonths', 'VenilatorSixMonths', 'AdmissionsSixMonths', 'MaxPreviousLOS'], 'imputer_cols': ['GLUCOSE, WHOLE BLOOD', 'HEMOLYSIS INDEX', 'SODIUM', 'POTASSIUM', 'GLUCOSE', 'CREATININE', 'CHLORIDE', 'CALCIUM', 'CO2 CONTENT (BICARBONATE)', 'UREA NITROGEN, BLOOD (BUN)', 'ANION GAP', 'HEMATOCRIT', 'HEMOGLOBIN', 'PLATELET COUNT', 'RED BLOOD CELL COUNT', 'MEAN CORPUSCULAR HEMOGLOBIN', 'MEAN CORPUSCULAR HEMOGLOBIN CONC', 'MEAN CORPUSCULAR VOLUME', 'WHITE BLOOD CELL COUNT', 'RED CELL DISTRIBUTION WIDTH', 'MEAN PLATELET VOLUME', 'ICTERIC INDEX', 'MAGNESIUM', 'NUCLEATED RED BLOOD CELLS', 'PHOSPHORUS (PO4)', 'EGFR', 'BILIRUBIN, TOTAL', 'TOTAL PROTEIN', 'ALBUMIN', 'ASPARTATE AMINOTRANSFERASE (AST)(SGOT)', 'ALKALINE PHOSPHATASE', 'ALANINE AMINOTRANSFERASE (ALT)(SGPT)', 'FIO2, ARTERIAL', 'PO2 (CORR), ARTERIAL', 'pH (CORR), ARTERIAL', 'BICARB, ARTERIAL', 'PCO2 (CORR), ARTERIAL', 'BASE, ARTERIAL', 'O2 SAT, ARTERIAL', 'TOTAL CO2, ARTERIAL', 'PT TEMP (CORR), ARTERIAL', 'PROTHROMBIN TIME', 'INR', 'NEUTROPHILS ABSOLUTE COUNT', 'MONOCYTES RELATIVE PERCENT', 'LYMPHOCYTES ABSOLUTE COUNT', 'NEUTROPHILS RELATIVE PERCENT', 'LYMPHOCYTE RELATIVE PERCENT', 'MONOCYTES ABSOLUTE COUNT', 'EOSINOPHILS, ABSOLUTE COUNT', 'PULSE', 'PULSE OXIMETRY', 'RESPIRATIONS', 'TEMPERATURE', 'R MAP', 'CPM S16 R AS PAIN RATING (0-10): REST', 'R MAINTENANCE IV VOLUME', 'ORAL INTAKE', 'URINE OUTPUT', 'CPM S16 R AS SC BRADEN SCORE', 'MUSC R URINE OUTPUT (ML)', 'CPM F12 ROW TUBE FEEDING INTAKE (ML) (ADULT, NICU, OB, PEDIATRIC)', 'R MAP A-LINE', 'R MORSE FALL RISK SCORE', 'MUSC R GENERAL OUTPUT (ML)', 'CPM S16 R AS SC GLASGOW COMA SCALE SCORE', 'WEIGHT/SCALE', 'R IP FN WEIGHT CHANGE', 'MUSC IP CCPOT TOTAL SCORE', 'CPM S16 R AS SC NIPS SCORE', 'MUSC IP R AVPU (TRANSFORMED)', 'CPM S16 R AS SC ALDRETE SCORE', 'CPM S16 R AS CURRENT WEIGHT (GM) (PEDIATRIC)', 'CPM S16 R AS SC BRADEN Q SCORE', 'BLOOD PRESSURE (SYSTOLIC)', 'BLOOD PRESSURE (DIASTOLIC)', 'MUSC R SC PHLEBITIS IV DEVICE (TRANSFORMED)', 'MUSC R AS SC INFILTRATION IV DEVICE (TRANSFORMED)', 'R ARTERIAL LINE BLOOD PRESSURE (SYSTOLIC)', 'R ARTERIAL LINE BLOOD PRESSURE (DIASTOLIC)', 'CPM S16 R AS SC RASS (RICHMOND AGITATION-SEDATION SCALE) (TRANSFORMED)', 'R MUSC ED WISCONSIN SEDATION SCALE (TRANSFORMED)', 'MetTemp', 'MetHR', 'MetRR', 'MetWBC', 'MaxTemp8', 'MaxTemp24', 'MaxTemp48', 'MinTemp8', 'MinTemp24', 'MinTemp48', 'MaxHR8', 'MaxHR24', 'MaxHR48', 'MinHR8', 'MinHR24', 'MinHR48', 'MaxRR8', 'MaxRR24', 'MaxRR48', 'MinRR8', 'MinRR24', 'MinRR48', 'MaxWBC8', 'MaxWBC24', 'MaxWBC48', 'MinWBC8', 'MinWBC24', 'MinWBC48', 'DaysSinceLastAdmission']} number of features : 170In a ROC curve, the true positive rate (Sensitivity) is plotted in function of the false positive rate (Specificity) for different cut-off values. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
Sensitivity (true positive rate) is the proportion of cases where the model predicts positive given that the target variable is present. Specificity (true negative rate) is the proportion of cases where the model predicts negative given that the target variable is absent.
The distribution of model prediction is shown below. The red line represents the proportion of cases where the target variable is present.
This plot displays the trade-off between cutoff value and accuracy. Optimal cutoff here is found by maximizing accuracy of model. This is not ideal with problems that have class imbalance; however, optimal cutoffs can be chosen when intervention costs are known.
Sensitivity (true positive rate) is the proportion of cases where the model predicts positive when the target variable is present. Specificity (true negative rate) is the proportion of cases where the model predicts negative when the target variable is absent. Positive predictive value is the proportion of cases where the target variable is present when the model predicts positive. Negative predictive value is the proportion of cases where the target variable is absent when the model predicts negative.
The variable importance plots display the most important features by their information gain, which is a measurement that sums up how much 'information' a feature gives about the target variable. Information gain measures the reduction in entropy, or uncertainty, over each of the times that the given feature is split on.
SHapley Additive exPlanations(SHAP) is an approach to explain the output of machine learning models. SHAP assigns a value to each feature for each prediction (i.e. feature attribution); the higher the value, the larger the feature’s attribution to the specific prediction. In cases of classification, a positive SHAP value indicates that a factor increases the value of the model's prediction(risk), whereas a negative SHAP value indicates that a factor decreases the value of the model's prediction. The sum of SHAP values over all features will approximately equal the model prediction for each observation. In the following plots blue points signify negative SHAP values, red points have positive SHAP values, and yellow points are values for which the feature attributes little to predicted value.
The following plot display the effect of the top text/category values on the Model's predictions. Red data points represent when the text/column feature is observed in a patient's records, whereas the blue represents when the feature is not observed.